TODO: Lot’s to tidy up and explain better, and last section on mapping predictions. Also real world example
Statistical modelling in R for ecologists (Part 2)
Here we build upon modelling skills developed in Part 1 of this coarse. In Part 1 we learned how to fit models to data in a frequentist paradigm.
Here we will fit the same models in a bayesian paradigm. Then we will build more complex, multivariate generalised linear models, to demonstrate when they might be preferred over over frequentist approaches.
We’ll compare multivariate generalised linear models (i.e., model-based ordination) to other algorithmic approaches typically used for analysing ecological community data (e.g., nmds).
Finally, we’ll show how these techniques can be used to model species distributions, and how R can handle spatial data.
Diving into the fundamental and nuanced differences between frequentist and bayesian statistics is beyond the scope of this short course. But you can find some nice, short overview of the key differences here and here.
From my point of view, both are useful. Often, I start by fitting Frequentist models to my data (like we did yesterday) and then move to more complex Bayesian models if needed.
Why?
Bayesian models require more ‘number crunching’ to estimate parameters (coefficients), and therefore can take a long time to fit compared to their frequentist analogues.
Bayesian models require us to define ‘priors’ for our model parameters. If we don’t have much prior information on model parameters (i.e., our priors are ‘uninformative’, which is usually the default), the coefficients estimated by Bayesian models should be the same as their frequentist analogues.
Sometimes though, we need to go Bayesian. For example, if we want to model spatio-temporal autocorrelation in our data, we might like to use Bayesian models in the R-INLA package.
Another advantage of Bayesian models is that our parameters are treated as random variables rather than fixed. In practical terms this means that, rather than a single point estimate for model coefficients (+/- confidence intervals), we get an entire probability distribution of plausible values for each coefficient (referred to as the posterior distribution). This provides us with a richer understanding our coefficients and therefore of the ecological relationships we are trying to estimate.
Here, we’ll go Bayesian because our ultimate goal is to fit a complex multivariate generalised linear model and account for spatial autocorrelation in our observations by including a spatial random effect.
Before we do though, we need to learn a little bit more about how Bayesian models estimate parameters.
Bayesian vs. frequentist parameter estimation
In part 1 of this course we touched on how model parameters (coefficients) are estimated with either ordinary least squares (general linear models) or maximum likelihood (generalised linear models).
The Bayesian models we’ll use today use Markov Chain Monte Carlo (MCMC) sampling to estimate the posterior distribution for each parameter (coefficient).
We don’t have time to discuss the details today, but the key things you need to know when fitting a Bayesian model with MCMC sampling are:
MCMC sampling is an algorithm that results in a chain of samples that form a probability distribution which, in our case, is the posterior distribution for each of our model parameters. Read more here
We have to set the following MCMC sampling parameters:
the number of chains (typically 4),
how many samples of the posterior to obtain in each chain (typically 1000),
the thinning rate (how often do we keep samples in each chain, e.g., a thinning rate of 1 means we keep every sample, a rate of 5 means we keep every 5th sample),
the length of the transient (also called ‘burn-in’) - this is how many initial samples we discard from each chain.
After we set the MCMC sampling parameters and fit the model, we need to check that the MCMC sampling has converged. This means that, in addition to checking traditional structural model assumptions for linear models like we learned in Part 1 (i.e., normality and homoskedasticity of residuals), we also need to check MCMC convergence diagnostics before making any inference or predictions from our model.
Okay, now we’re ready to try fitting a Bayesian model.
General(ised) linear models in a Bayesian framework
In Part 1, we fitted general and generalised linear models in a frequentist framework. (Remember general linear models assume a normal error structure (think tree circumference as our response variable (\(y\))), whereas generalised linear models can accommodate non-normal error structures (think species counts or presence-absence as the response \(y\))).
Let’s try fitting the same models in a Bayesian framework using MCMC sampling for parameter (coefficient) estimation. We’ll use weak, uninformative priors (the default), meaning our coefficient estimates from the Bayesian model should be the same as the Frequentist.
In the interest of time, we’ll just repeat the generalised linear model of species abundance in relation to fire severity. But if you’re keen, feel free to try the general linear model of tree circumference (you’ll just need to specify a normal (gaussian) distribution instead of a poisson).
First, let’s simulate the data we need for species abundance and fire severity (repeating exactly as we did in Part 1).
library(ggplot2)set.seed(123) # set seed for random number generator to ensure results are reproducibleb0 <-log(10) # average species abundance when fire severity is = 0 (i.e, intercept) is 5 10b1 <--log(1/0.5) # species abundance decreases by a factor of 2 (aka 50% decrease in abundance) for a one unit increase in fire severityn <-100# number of observationsX <-runif(n, min =0, max =1) # fire severity measurementslambda <-exp(b0 + b1*X) # estimate mean species abundance (lambda) on multiplicative scale using the exponenty <-rpois(n, lambda) # simulate species abundance observation from each lambdadf <-data.frame(Fire_severity = X, Species_abundance = y) # make a dataframe of observationsggplot(df) +aes(x = Fire_severity, y = Species_abundance) +geom_point() +ylab('Species abundance') +xlab('Fire severity') +geom_smooth(method ='lm') +#geom_smooth(method = 'loess') +theme_classic()
`geom_smooth()` using formula = 'y ~ x'
Fitting a Bayesian GLM and checking MCMC convergence
Now we can fit the model. We’ll use {Hmsc} to do this - flexible Bayesian modelling framework for fitting complex spatial models with random effects. ‘Hmsc’ is an acronym for ‘Hierarchical modelling of species communities’. We can use it to fit univariate models (as we are right now) and multivariate models (as we’ll do later).
Let’s load the package and fit a Bayesian generalised linear model where our response \(y\) is species abundance and our explanatory variable \(x\) is fire severity.
library(Hmsc)
Loading required package: coda
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated
# set up the modelm <-Hmsc(Y =as.matrix(df$Species_abundance),XData = df,XFormula =~ Fire_severity,distr ='poisson')# fit the model with MCMC samplingm_fit <-sampleMcmc(m, nChains =4, samples =1000, thin =1, transient =2500, verbose = F)
setting updater$GammaEta=FALSE due to absence of random effects included to the model
So, as you can see, takes a bit longer than our frequentist glms. Now we want to check whether the MCMC sampling has converged.
First, we’ll extract the posterior distributions for the coefficients estimated by our model (i.e., the y-intercept and the beta coefficients).
mpost <-convertToCodaObject(m_fit)
Now we can look at trace plots of the MCMC sampling for each of the posterior distributions to see whether they have converged. We expect to see random mixing of the four MCMC chains (a fuzzy caterpillar), and bell-shaped, unimodal posterior distributions.
plot(mpost$Beta)
Not ideal. The chains are not mixing very well, and the posterior distributions are not bell-shaped.
We can also get some quantitative convergence diagnostics, such as the effective sample size of the MCMC chains, to help us diagnose the issue. We would expect the effective samples size to be close to 4000 (1000 samples/chain * 4 chains). If it is less, that indicates there is auto-correlation in the MCMC sampling
Whoa, much less - we have an issue with autocorrelation in the sampling.
We can also check the gelman potential scale reduction factors (which should be close to 1, indicating the individual chains give similar results).
gelman.diag(mpost$Beta,multivariate=FALSE)$psrf
Point est. Upper C.I.
B[(Intercept) (C1), sp1 (S1)] 1.009974 1.029671
B[Fire_severity (C2), sp1 (S1)] 1.013335 1.034058
Higher than 1. This confirms what we saw above in the trace plots - the chains do not mix well.
Let’s try increasing the thinning rate to improve convergence of the MCMC sampling. This will mean the model will take a lot longer to fit - it’s doing a a lot more sampling because it can only keep every 100th sample to get a total of 1000 samples for each chain.
So I’m going to ask it to run the MCMC chains in ‘parallel’ to speed it up using the argument nParallel.
# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 10, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_10thinning.rds')m_fit <-readRDS('model_10thinning.rds')# extract posterior distributions for beta coefficientsmpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Looks better, what about our quantitative diagnostics?
Point est. Upper C.I.
B[(Intercept) (C1), sp1 (S1)] 1.004772 1.016165
B[Fire_severity (C2), sp1 (S1)] 1.003701 1.012686
Improved - the psrf indicate the chains are providing similar results (as we can see in the trace), but the effective sample size is still low. Let’s bump up the thinning a bit more.
# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 100, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_100thinning.rds')m_fit <-readRDS('model_100thinning.rds')# extract posterior distributions for beta coefficientsmpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Excellent, even better than before. How about the effective sample size?
Point est. Upper C.I.
B[(Intercept) (C1), sp1 (S1)] 1.0003794 1.003090
B[Fire_severity (C2), sp1 (S1)] 0.9997679 1.001205
Much better. We’re happy now that our MCMC sampling is converged, meaning that, as long as our structural model assumptions are met, we can make inference from this model. Note, however, that sometimes you might notice that earlier samples in your trace plots look different from later samples. In this case you can try increasing the transient (i.e. the initial samples that are discarded.)
Checking structural model assumptions
Now we want to check the same model assumptions we have when fitting frequentist GLMs: the model residuals are normally distributed and have homogeneity of variance.
Unfortunately our handy plot function we used in Part 1 won’t work for {Hmsc} model objects. So we have to calculate the model residuals ourselves and plot them.
preds <-computePredictedValues(m_fit) # array of fitted (predicted) values for species abundance for each of the 4000 MCMC samplesFitted.mean <-apply(preds, FUN = mean, MARGIN =1) # get the mean predicted value species abundance (n = 4000)Residuals <-scale(df$Species_abundance - Fitted.mean) # calculate standarised residualspar(mfrow =c(1,2)) # set graphical parametersplot(Fitted.mean, Residuals); abline(a =0, b =0) # plot the fitted vs. residualsqqnorm(Residuals); qqline(Residuals) # qq plot
Looks very similar to the diagnostic plots we made in Part 1, where we fitted a frequentist GLM to the same data.
Now let’s check the coefficient estimates are close to our known truths.
summary(mpost$Beta)
Iterations = 2600:102500
Thinning interval = 100
Number of chains = 4
Sample size per chain = 1000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
B[(Intercept) (C1), sp1 (S1)] 2.3526 0.07106 0.001124 0.001123
B[Fire_severity (C2), sp1 (S1)] -0.7703 0.13709 0.002168 0.002197
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
B[(Intercept) (C1), sp1 (S1)] 2.213 2.305 2.3532 2.399 2.491
B[Fire_severity (C2), sp1 (S1)] -1.038 -0.862 -0.7712 -0.676 -0.506
Remember, these predictions are in log-space, so we need to back-transform them to the natural scale. But, as you can see, they are almost exactly the same as our frequentist model!
exp(summary(mpost$Beta)$statistics[1]) # back transform the y-intercept (mean species abundance when fire severity = 0)
[1] 10.51245
exp(summary(mpost$Beta)$statistics[2]) # back transform the beta cofficient (mean species abundance when fire severity = 0)
[1] 0.4628578
Multivariate Generalised linear model
Okay, so now we know how to fit a Bayesian GLM, and we can see that with default, uninformative priors it produces the same coefficient estimates as it’s frequentist analogue. It took longer to fit though, and there were more diagnostics to check (MCMC convergence in addition to standard structural model assumptions). So why go Bayesian?
Well, in the previous example, we might opt to stick with a Frequentist GLM in the interest of saving time. But, what if we had observations of multiple species’ abundances that we wanted to model together? We can build upon our Bayesian model to do so, and include a site-level random effect to estimate residual correlations in species’ abundances (i.e., the correlations in species’ abundances left unexplained, after accounting variability explained by explanatory variables (e.g., Fire severity)).
You might be thinking, this sounds like a constrained ordination. You would be right! Fitting multivariate glms is analagous to fitting a constrained ordination. The difference is, the former is a model-based approach to ordination (which means we can do things like account for lack of independence in our samples due to spatial autocorrelation (we’ll do that later)), and the latter is an algorithmic approach to ordination.
For interest, we’ll fit a algorithmic unconstrained ordination (nMDS) to a multivariate dataset of species’ abundance and compare this to a multivariate glm.
Let’s start by simulating a new dataset of species abundance data for multiple species. We’ll also make this a spatial example with longitude (x) and latitude (y) coordinates for each location where species are surveyed.
library(MASS)n <-100ns <-5b0 <-log(rep(10,ns))b1 <-log(c(2,1.1,1,1.1,2))*c(-1,-1,0,1,1)beta <-cbind(b0,b1)X <-cbind(rep(1,n),runif(n, min =0, max =1)) # fire severity measurementsxycoords <-data.frame(x =runif(n, 152.1,152.5), y =runif(n, 28.1, 28.5)*-1)sigma.spatial <-c(2)alpha.spatial <-c(0.35)Sigma <- sigma.spatial^2*exp(-as.matrix(dist(xycoords))/alpha.spatial)eta1 <-mvrnorm(mu=rep(0,n), Sigma=Sigma)lambda1 <-c(1,2,-2,-1,0)spatial_res <- eta1%*%t(lambda1)lambda <-exp((X%*%t(beta))) + spatial_resy <-data.frame(apply(data.frame(lambda), 2, function(x) rpois(n, x)))colnames(y) <-c('spp1', 'spp2', 'spp3', 'spp4', 'spp5')df <-data.frame(xycoords, y, Fire_severity = X[,2])head(df)
Before we try model-based ordination of our multivariate species abundance data, let’s try a quick exploratory nMDS to look for patterns.
library(vegan)nmds <-metaMDS(df[,3:7], distance ='bray', k =2)
nmds
Call:
metaMDS(comm = df[, 3:7], distance = "bray", k = 2)
global Multidimensional Scaling using monoMDS
Data: wisconsin(df[, 3:7])
Distance: bray
Dimensions: 2
Stress: 0.1852856
Stress type 1, weak ties
Best solution was repeated 5 times in 20 tries
The best solution was from try 4 (random start)
Scaling: centring, PC rotation, halfchange scaling
Species: expanded scores based on 'wisconsin(df[, 3:7])'
The stress is an estimate of the goodness of fit of the nmds. It is less than 0.2, indicating a decent fit.
Let’s plot the ordination and see which survey locations are similar based on the abundance and composition species. First we need to extract the site scores and species scores for the two nMDS dimensions.
library(dplyr)
Attaching package: 'dplyr'
The following object is masked from 'package:MASS':
select
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
So shows patterns we expected. We could do a permutational analysis of variance (PERMANOVA) to formally test whether species abundance and composition differs between fire classes.
adonis2(nmds_scores[,3:7] ~ Fire_class, data = nmds_scores)
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = nmds_scores[, 3:7] ~ Fire_class, data = nmds_scores)
Df SumOfSqs R2 F Pr(>F)
Fire_class 1 0.26271 0.0967 10.49 0.001 ***
Residual 98 2.45415 0.9033
Total 99 2.71686 1.0000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
However, the inference we can make regarding sizes of effects for individuals species and the strength of the associations between species after accounting for fire severity is limited.
So let’s try model-based ordination with a multivariate GLM instead and see if we gain any extra inference.
Note that there are many other approaches to algorithmic ordination than we’ve covered here. nMDS is a type of ‘unconstrained’ ordination, but there are other approaches for ‘constrained’ ordination which are a closer analogue to the model-based ordination we’ll try next.
Model-based ordination
As before, we’ll set up our model and fit it with MCMC sampling. The main difference is this time we include a site-level random effect that models the residual correlations across species as latent (unobserved) variables.
# set up random effectstudyDesign <-data.frame(sample =as.factor(1:nrow(df)))rL <-HmscRandomLevel(units = studyDesign$sample)# set up the modelm <-Hmsc(Y =as.matrix(df[,3:7]),XData = df,XFormula =~ Fire_severity,distr ='poisson',studyDesign = studyDesign, ranLevels =list(sample = rL))# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 100, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_100thinning_mult.rds')m_fit <-readRDS('model_100thinning_mult.rds')mpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Okay, so convergence diagnostics look good. Let’s take a look at the structural model assumptions.
preds <-computePredictedValues(m_fit) # array of fitted (predicted) values for species abundance for each of the 4000 MCMC samplesFitted.mean <-apply(preds, FUN = mean, MARGIN =c(1,2)) # get the mean predicted value species abundance (n = 4000)Residuals <-scale(df[,3:7] - Fitted.mean) # calculate standarised residualspar(mfrow =c(5,2)) # set graphical parametersfor(i in1:ncol(Residuals)){Fitted.mean.spp <- Fitted.mean[,i]Residuals.spp <- Residuals[,i]plot(Fitted.mean.spp, Residuals.spp, main = i); abline(a =0, b =0) # plot the fitted vs. residualsqqnorm(Residuals.spp, main = i); qqline(Residuals.spp) # qq plot}
Species 1 looks okay, but there are some strong patterns in the residuals vs. fitted plots for species 2-5. What’s going on here? The patterns might be due to another process influencing species abundance in our data that we haven’t accounted for yet - spatial autocorrelation.
Whenever you’re fitting a model to spatial data, you should check whether the residuals are spatially auto-correlated. We’ll do that visually by mapping the residuals.
Variable(s) "abundance" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
We can see that ‘green’ positive residuals tend to clustered in some places, and likewise for ‘red’ negative residuals, indicating that there is spatial autocorrelation in where our model tends to be over- or under-predicting.
Before we learn more about how to deal with this, let’s just check whether out coefficient estimates are biased from the known truths. Let’s remind ourselves what those were for each species.
So definitely some bias, although generally the sign of the effects are correct (i.e. negative vs. positive) and the discrimination is good, e.g., fire severity has a larger negative effect on abundance of species 1 than species 2, and a larger positive effect for species 5 than species 4.
Let’s see if we can improve our model by including a spatial random effect to account for spatial autocorrelation.
Accounting for spatial autocorrelation
What is spatial autocorrelation? It’s Tobler’s first law of geography, ‘Everything is related to everything else, but near things are more related than distant things’.
This means that observations of species abundance that are closer together in space are more likely to be similar than observations made farther apart. This means our observations are not independent of each other - which is an assumption of general(ised) linear models.
We can include a spatial random effect that will estimate the spatial autocorrelation in our species abundance observations. Let’s try it.
# set up spatial random effectstudyDesign <-data.frame(sample =as.factor(1:nrow(df)))rL.spatial =HmscRandomLevel(sData = df[,c(1,2)])rL <-HmscRandomLevel(units = studyDesign$sample)# set up the modelm <-Hmsc(Y =as.matrix(df[,3:7]),XData = df,XFormula =~ Fire_severity,distr ='poisson',studyDesign = studyDesign, ranLevels =list(sample = rL.spatial))# fit the model with MCMC sampling#m_fit <- sampleMcmc(m, nChains = 4, samples = 1000, thin = 100, transient = 2500, verbose = F, nParallel = 4)#saveRDS(m_fit, 'model_100thinning_mult_spatial.rds')m_fit <-readRDS('model_100thinning_mult_spatial.rds')mpost <-convertToCodaObject(m_fit)plot(mpost$Beta)
Now let’s see if our residual diagnostic plots look better.
preds <-computePredictedValues(m_fit) # array of fitted (predicted) values for species abundance for each of the 4000 MCMC samplesFitted.mean <-apply(preds, FUN = mean, MARGIN =c(1,2)) # get the mean predicted value species abundance (n = 4000)Residuals <-scale(df[,3:7] - Fitted.mean) # calculate standarised residualspar(mfrow =c(5,2)) # set graphical parametersfor(i in1:ncol(Residuals)){Fitted.mean.spp <- Fitted.mean[,i]Residuals.spp <- Residuals[,i]plot(Fitted.mean.spp, Residuals.spp, main = i); abline(a =0, b =0) # plot the fitted vs. residualsqqnorm(Residuals.spp, main = i); qqline(Residuals.spp) # qq plot}
Much better! Now check spatial autocorrelation in the residuals.
Variable(s) "abundance" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.
Now, let’s see if our coefficient estimates are closer to known truths
TODO: we should also compare explanatory power - will be much higher with spatial random effect
So, what’s the advantage over algorithmic ordination approaches? With model-based ordination we can account for spatial autocorrelation and get accurate and precise estimates of the effect of fire severity on the abundance of each species.
And plot the residual correlations between species. These could be because species interact with one another (e.g., predator-prey interactions), we’re missing some important explanatory variables in our model and Tobler’s first law (i.e., spatial autocorrelation).
So, after accounting for variability in species abundance due to fire severity, we infer that abundances of species 1 are positively correlated with abundances of species 2, and negatively correlated with species 3 and 4. This is exactly what we expected, given how we simulated spatial autocorrelation in the data.
Unexpectedly, there is a positive residual correlation between species 1 and 5, but if we look at the support for that correlation, is is less than 95%.
We now extract the posterior means of the \(\eta\) and \(\lambda\) parameters, which represent the site loadings (\(\eta\)) and species loadings (\(\lambda\)) on the latent factors, to plot on ordination of sites that are similar in species composition and abundance after accounting for fire severity. Note this will look different from the nMDS ordination we made earlier, as what we’ve done here is a constrained ordination. The nMDS is an unconstrained ordination.
Now that we’ve removed variability in species abundances due to fire severity, we can see site species composition and abundance is longer are associated fire severity.
Predicting and mapping species abundance everywhere
Now that we are happy with our spatial, multivariate model of species abundance we might like to use it to make predictions of expected species abundance with fire severity everywhere in our survey area.
First we will generate a raster of expected fire density by interpolating between our observations of fire severity. As a reminder, we had simulated fire severity at our survey locations and it looks like this.
qtm(df.sf, dots.col ='Fire_severity')
Let’s start by making a grid covering our survey area, which we can then use for interpolation. A spatial grid is what we often refer to as ‘raster’ data (the points above are referred to as ‘vector’ data). {terra} is good R package for working with spatial raster data.
library(terra)
terra 1.7.71
Attaching package: 'terra'
The following object is masked from 'package:tidyr':
extract
The following object is masked from 'package:MASS':
area
The following objects are masked from 'package:coda':
varnames, varnames<-
Now do the interpolation using thin plate splines to predict fire severity everywhere.
TODO: need to improve this, probs just simulate fire severity everywhere rather than interpolate
library(fields)
Loading required package: spam
Spam version 2.10-0 (2023-10-23) is loaded.
Type 'help( Spam)' or 'demo( spam)' for a short introduction
and overview of this package.
Help for individual functions is also obtained by adding the
suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
Attaching package: 'spam'
The following objects are masked from 'package:base':
backsolve, forwardsolve
Loading required package: viridisLite
Try help(fields) to get started.
Attaching package: 'fields'
The following object is masked from 'package:terra':
describe